Tested Q_rsqrt on Apple M4 (Mac mini) and Zen 3 (Ryzen 5800HS / WSL2). M4's -O2 already rewrites 1/sqrtf to frsqrte and ties Q_rsqrt; x86 clang needs -ffast-math or hits a 12x gap. Hand-written NEON/SSE wrappers turn out slower. Newton 0/1/2 error and the Lomont constant covered too.
oMLX 0.3.9.dev2 release notes from the angle of Codex/Copilot on Mac local LLMs: Gemma 4 VLM MTP, DFlash, omlx launch copilot, SSD KV cache — what each changes for agent workflows.
Tested Klein 9B + 9B NSFW LoRA on M1 Max 64GB via mflux 0.17.5: 1m51s/512, 5m37s/1024 q4, 224/224 LoRA keys match, NSFW prompts uncensored, Japanese subjects work with helper tokens.
Klein 4B / 9B / Base LoRAs aren't cross-compatible — a 9B NSFW LoRA throws 'lora key not loaded' on mflux's 4B path. The variant map, what mflux runs today, and where the working hands-on test lives.
Three local image generation engines (WAI-Anima, WAI-IL/SDXL, FLUX.2 Klein 4B) tied together by a thin FastAPI wrapper that takes Japanese prompts. Ollama (gemma3:12b) handles JP→EN, ComfyUI workflows are built on the fly in Python, FLUX.2 runs as an mflux subprocess, and the whole thing is reachable from an iPhone over Tailscale.
Hands-on log of building the DEV article's PDF RAG on M1 Max 64GB, extending it with images via CLIP, and pushing through Japanese with bge-m3 + Qwen3.6 35B. Documents the modality gap, the dual inference server crash, and LLM-jp 4-8B's empty chat template silently dropping the system role.
A hands-on log of running Qwen-Scope's Sparse Autoencoder locally on M1 Max 64GB with Qwen3-8B-Base, extracting feature IDs that discriminate between Japanese, English, code, and Chinese from a single middle layer.
Hands-on benchmark of FLUX.2 Klein 4B on M1 Max 64GB using mflux (MLX) and iris.c (pure C + Metal). A counter to Pruna AI's H100-only tutorial — measuring how fast Apple Silicon actually gets there.
After Xiaomi MiMo-V2.5's weights went public, I checked whether it runs on Mac/ROCm or on cloud GPU (RunPod/GCE). It's still rough on local hardware, but RunPod's 4x H200 runs it for ~$14/hr and GCE Spot H100 brings it down to ~$1.6/hr.
Confirmed SeeSee21/Z-Anime is a full fine-tune of Z-Image Base, then ran the AIO version on local ComfyUI on an M1 Max 64GB to verify t2i, i2i, and how NSFW prompts pass through.
A verification log for converting color anime-style AI illustrations to manga-style monochrome. AI re-generation approaches lean to either color leakage or face drift, and pure deterministic local processing looks mechanical. Frames the next directions to try: putting a grayscale-only LoRA on Anima, and using See-through for part decomposition before mechanical composition.